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Abstract 



There is increasing interest in broad appli- 
cation areas in defining flexible joint mod- 
els for data having a variety of measurement 
scales, while also allowing data of complex 
types, such as functions, images and docu- 
ments. We consider a general framework for 
nonparamctric Bayes joint modeling through 
mixture models that incorporate dependence 
across data types through a joint mixing mea- 
sure. The mixing measure is assigned a novel 
infinite tensor factorization (ITF) prior that 
allows flexible dependence in cluster alloca- 
tion across data types. The ITF prior is for- 
mulated as a tensor product of stick-breaking 
processes. Focusing on a convenient special 
case corresponding to a Parafac factorization, 
we provide basic theory justifying the flex- 
ibility of the proposed prior and resulting 
asymptotic properties. Focusing on ITF mix- 
tures of product kernels, we develop a new 
Gibbs sampling algorithm for routine imple- 
mentation relying on slice sampling. The 
methods are compared with alternative joint 
mixture models based on Dirichlct processes 
and related approaches through simulations 
and real data applications. 



1 INTRODUCTION 

There has been considerable recent interest in joint 
modeling of data of widely disparate types, including 
not only real numbers, counts and categorical data but 
also more complex objects, such as functions, shapes, 
and images. We refer to this general problem as mixed 
domain modeling (MDM), and major objectives in- 
clude exploring dependence between the data types, 
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co-clustering, and prediction. Until recently, the em- 
phasis in the literature was almost entirely on para- 
metric hierarchical models for joint modeling of mixed 
discrete and continuous data without considering more 
complex object data. The two main strategies are to 



rely on underlying Gaussian variable models ( Muthen 



1984) or exponential family models, which incorporate 
shared latent variables in models for the different out- 
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cently, there have been a number of articles using these 
models as building blocks in discrete mixture models 
relying on Dirichlet processes (DPs) or closely-related 



variants (Cai et al. 2011 Song et al. 2009 Yang & 



Dunson 2010). DP mixtures for mixed domain model- 



ing were also considered by Hannah et al. ( 201 1 ) ; Shah- 
baba k NeaT| ( |2009[ ); [Dunson fc Bhattacliarya| ( |2010| ) 
among others. Related approaches are increasingly 
widely-used in broad machine learning applications, 
such as for joint modeling of images and captions (Li 



et al. 2011 ), and have rapidly become a standard tool 



for MDM. 

Although such joint Dirichlet process mixture mod- 
els (DPMs) are quite flexible, and can accommodate 
joint modeling with complicated objects such as func- 



tions ( Bigelow & Dunson 2009 ) , they suffer from a key 



disadvantage in relying on conditional independence 
given a single latent cluster index. For example, as 
motivated in |Dunson| ( |2009| |20 10| , the DP and related 
approaches imply that two subjects i and i' are either 
allocated to the same cluster (Ci = CV) globally for all 
their parameters or are not clustered. The soft prob- 
abilistic clustering of the DP is appealing in leading 
to substantial dimensionality reduction, but a single 
global cluster index conveys several substantial practi- 
cal disadvantages. Firstly, to realistically characterize 
joint distributions across many variables, it may be 
necessarily to introduce many clusters, degrading the 
performance in the absence of large sample sizes. Sec- 
ondly, as the DP and the intrinsic Bayes penalty for 
model complexity both favor allocation to few clus- 
ters, one may over cluster and hence obscure impor- 
tant differences across individuals, leading to mislead- 
ing inferences and poor predictions. Often, the poste- 
rior for the clusters may be largely driven by certain 
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components of the data, particularly when more data 
are available for those components, at the expense of 
poorly characterizing components for which less, or 
more variable, data are available. 

To overcome these problems we propose Infinite Ten- 
sor Factorization (ITF) models, which can be viewed 
as next generation extensions of the DP to accommo- 
date dependent object type-specific clustering. Instead 
of relying on a single unknown cluster index, we pro- 
pose separate but dependent cluster indices for each 
of the data types whose joint distribution is given by 
a random probability tensor. We use this to build 
a general framework for hierarchical modeling. The 
other main contribution in this article is to develop a 
general extension of blocked sliced sampling, which al- 
lows for an efficient and straightforward algorithm for 
sampling from the posterior distributions arising with 
the ITF; with potential application in other multivari- 
ate settings with infinite tensors, without resorting to 
finite truncation of the infinitely many possible levels. 

2 PRELIMINARIES 

We start by considering a simple bivariate setting 
p = 2 in which data for subject i consist of yi — 

{vn^Vii)' e y, with y = y x ®y 2 , y a e y x , and 

Ui2 € 3^2 for i = 1, . . . , n. We desire a joint model in 
which yi ~ /, with / a probability measure character- 
izing the joint distribution. In particular, letting B(y) 
denote an appropriate sigma- algebra of subsets of y, 
f assigns probability f(B) to each B G B(y). We as- 
sume y is a measurable Polish space, as we would like 
to keep the domains J^i and 3^2 as general as possible 
to encompass not only subsets of Euclidean space and 
the set of natural numbers but also function spaces 
that may arise in modeling curves, surfaces, shapes 
and images. In many cases, it is not at all straightfor- 
ward to define a parametric joint measure, but there 
is typically a substantial literature suggesting various 
choices for the marginals yn ~ fi and yi2 ~ f% sepa- 
rately. 

If we only had data for the jth variable, yij , then one 
possible strategy is to use a mixture model in which 



the Sethuraman (1994) stick-breaking representation, 



fj(B) 



B e BQ> S ), (l) 



where JCj(-; 9j) is a probability measure on {yx, B(y%)} 
indexed by parameters Oj € 0j, K,j obeys a paramet- 
ric law (e.g., Gaussian), and Pj is a probability mea- 
sure over {Oj,B(Oj)}. A nonparametric Bayesian ap- 
proach is obtained by treating Pj as a random proba- 
bility measure and choosing an appropriate prior. By 
far the most common choice is the Dirichlet process 
(|Ferguson| [19731), which lets P j ~ DP(aP 0j ). Under 



one then obtains, 

oo 
h=l 

Kh = v h '[[(i-v l ), e* h ~p 0j , ( 2 ) 

Kh 

and Vh ~ Be(l,a), so that fj can be expressed as a 
discrete mixture. This discrete mixture structure im- 
plies the following simple hierarchical representation, 
which is crucially used for efficient computation: 

~ JCj(dc t ), 8* h ~P 0j , pr (C* = h) = % h , (3) 

where Cj is a cluster index for subject i. The great 
success of this model is largely attributable to the di- 
vide and conquer structure in which one allocates sub- 
jects to clusters probabilistically, and then can treat 
the observations within each cluster as separate instan- 
tiations of a parametric model. In addition, there is a 
literature showing appealing properties, such as mini- 
max optimal adaptive rates of convergence for DPMs 
of Gaussians QShen fc Ghosalj [20TT] |Tokdar[ [20lT] ) . 

The standard approach to adapt expression (JlJ to ac- 
commodate mixed domain data is to simply let f(B) 



J e fC{B; 6)dP(8), for all B € B{y), where £(•; 9) is an 
appropriate joint probability measure over {y,B(y)} 
obeying a parametric law. Choosing such a joint law 
is straightforward in simple cases. For example, |Han-| 
nah et al. (20111 rely on a joint exponential family 



distribution formulated via a sequence of generalized 
linear models. However, in general settings, explicitly 
characterizing dependence within /C(-;#) is not at all 
straightforward and it becomes convenient to rely on 



a product measure (Dunson & Bhattacharya 2010): 



JC(B;( 



YlWBjiOj), B = ®B j , Bj€B(yj). (4) 

3 J=l 



If we then choose P - DP(aP ) with P Q = ®? =1 P^, 
we obtain an identical hierarchical specification to (|3| , 
but with the elements of yi = {ytj} conditionally in- 
dependent given the cluster allocation index Cj. 

As mentioned in §1, this conditional independence as- 
sumption given a single latent class variable is the 
nemesis of the joint DPM approach. We consider 
more generally a multivariate C; = (Cn, . . . , Ci P ) T € 
{l,...,oo} p , with separate but dependant indices 
across the disparate data types. We let, 



pr(Cji =hi,..., C lp — h p ) — -K hl ... h 



with hj = 1, . . . , oo, j 



,p, 



(•5) 



where n = {nh 1 --h p \ € is an infinite p-way proba- 
bility tensor characterizing the joint probability mass 
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function of the multivariate cluster indices. It remains 
to specify the prior for the probability tensor 7r, which 
is considered next in §3. 

3 PROBABILISTIC TENSOR 
FACTORIZATIONS 

3.1 PARAFAC Extension 

Suppose that CV,- <G {1, . . . ,dj}, with dj the number of 
possible levels of the jth cluster index. Then, assum- 
ing that Cj are observed unordered categorical vari- 



ables, Dunson & Xing (2009) proposed a probabilistic 



Parafac factorization of the tensor n: 



(i) 

h 



(6) 



h=l 



where A = {A^} follows a stick-breaking process, 

iph — (iphi, ■ • • >iphd-) T ^ s a probability vector spe- 
cific to component h and outcome j, ® denotes the 
outer product. 

We focus primarily on generalizations of the Parafac 
factorization to the case in which Ci is unobserved and 
can take infinitely-many different levels. We let, 



oo p 



pr 



(d = C1 , . . . , c p = = e x h n K 

h = l j=l 



(J) 



Kh 



1>% = ^IIC 1 -^)' ^-86(1,^), (7) 

A more compact notation for this factorization of the 
infinite probability tensor 7r is, 



h=l 3 = 1 

A ~ Stick(a), ^ } ~ Stick(/3 i ), 



(8) 
(9) 



which takes the form of a stick-breaking mixture of 
outer products of stick-breaking processes. This form 
is carefully chosen so that the elements of ir are 
stochastically larger in those cells having the small- 
est indices, with rapid decreases towards zero as one 
moves away from the upper right corner of the tensor. 

It can be shown that tensors realizations from the ITF 
distribution are valid in the sense that they sum to 1 
with probability 1. We can be flexible in terms where 
exactly these cluster indices occur in a hierarchical 
Bayesian model. Next in §3.2, we formulate a generic 
mixture model for MDM, where the ITF is used char- 
acterize the cluster indices of the parameters governing 
the distributions of the disparate data-types. 



3.2 Infinite Tensor Factorization Mixture 

Assume that for each individual i we have a data en- 
semble (yn,...,Vi p ) G y where y = (g) p J=i yj- Let 
B(y) be the sigma algebra generated by the prod- 
uct sigma algebra S(3^i) x • • • x B(y p ). Consider any 
Borel set B = ®? =1 Bj € B(y). Given cluster in- 
dices (Cji = Cji, . . . , Cip — Cip), we assume that the 
ensemble components are independent with 

fiVii € B\, . . . ,Ui p € Bp I Cji — hi, ... , dp = h p ) 



(10) 



ICj(-;9j t h) is an appropriate probability measure on 
{3^,23(3^)} as in equation ([!]). Marginalizing out the 
cluster indices, we obtain 



f(yn € B u ...,y ip € B p ) 



OO 

hi = l 



h„ = l 



W^U i. (11) 



h„ 



We 



where n hlt .... hp = pr(Cn = hi,...,C ip 
let 7r ~ ITF(a,/3) and we call the resulting mixture 
model an infinite tensor factorization mixture, / ~ 
ITM(a,/3). To complete the model specification, we 
let 9j t hj ~ Poj independently as in (|2j). 

The model y,- ~ /, / ~ ITM(a, /?), can be equivalently 
expressed in hierarchical form as 



Vij 



K 



ITF(a,/3), 



= P 



fei =1 



p 



0j • 



(12) 



Here, P is a joint mixing measure across the different 
data types and is given a infinite tensor process prior, 
P ~ ITP(a, P, 0j =1 Poj)- Marginalizing out the ran- 
dom measure P, we obtain the same form as in (111. 



The proposed infinite tensor process prior provide a 
much more flexible generalization of existing priors for 
discrete random measures, such as the Dirichlet pro- 
cess or Pitman Yor process. 

4 POSTERIOR INFERENCE 

4.1 Markov Chain Monte Carlo Sampling 

We propose a novel algorithm for efficient exact 
MCMC posterior inference in the ITM model, uti- 
lizing blocked and partially collapsed steps. We 



adapt ideas from Walker (2007); Papaspiliopoulos & 



Roberts (2008) to derive slice sampling steps with 



label switching moves, entirely avoiding truncation 
approximations. Begin by defining the augmented 
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joint likelihood for an observation t/j, cluster la- 
bels Cj = (cjo> Cji) • • • j c%v) an d snce variables itj = 
(uio,ua, . . . ,u ip ) as 

p(yi,c t ,Ui | A,*, 9) 

= 1 (u l0 < A Ci0 ) f[ KjivijlB^l fa < ^ 

(13) 



It is straightforward to verify that on marginalizing 
itj the model is unchanged, but including u, induces 
full conditional distributions for the cluster indices 
with finite support. Let rrioh = Ei=i ( c *o = h) 
and 2?o = I' 1 : m o/i > 0}. Similarly define 
m jhk = E"=i 1 ( c io =h)l (cij = k) and Vj = {k : 
EftLi m jhk > 0}, and let k* — max(X>j) for < j < p. 
Define Uq = {u^ : 1 < i < n}, Cq — {cm : 1 < i < n}, 
U\ = {uij : 1 < i < n, 1 < j < p} and C\ = {c^ : 1 < 
i < n, 1 < j < p}. The superscript (— i) denotes that 
the quantity is computed excluding observation i. 

1. Block update (Uq, A, a) 



(a) Sample (a | Co). Standard results (Antoniak 



1974) give 



p(a | Co) oc p(a)a c 



r(q) 

T(a + n) 



for c = |2?o| which can be sampled via 
Metropolis-Hastings or using auxiliary vari- 
ables when p(a) is a mixture of Gamma dis- 



tributions (Escobar & West 19951 



(b) Sample (A | a, Co) by drawing Vh ~ Beta(l + 

m h, a + E;=ft+i m oi) for 1 < /i < fcp and 
setting \ h = V h Yl l<h {l-V{) 

(c) Label switching moves: 

i. From T>q choose two elements 
hi , /12 uniformly at random and 
change their labels with probability 

min(l,(A /ll /A, l2 ) moh2 ~ moh O 

ii. Sample a label h uniformly from 
l,2,...,fcg and propose to swap the 
labels h, h + 1 and corresponding stick 
breaking weights Vh,Vh+i- Accept with 
probability min(l,a) where 



1 



i(h=k*) 



(1 - V h ) m ° {h+1) 
(1 - V h+1 ) moh 



(d) Sample (uio\c i0 ,X) 
dently for 1 < i < n 



U(0,X Cio ) indepen- 



2. Update Co- From (13 1 the relevant probabilities 
are 



Pr(c i0 = h\ui,Ci,^,X) 



1 (mo < Aft) JJ 1 [Uij < V'ft 



(14) 



3=1 



However, it is possible to obtain more efficient up- 
dates through partial collapsing, which allows us 
to integrate over the lower level slice variables and 
W instead of conditioning on them. Then we have 



Pr(c iQ = k | u i0 ,Ci,C^ cx 1 (u i0 < X h ) 



n 



A-i) 

s>2 ^" l jks 



i-i) 
l jks 



(15) 



To determine the support of (15) we need to en- 
sure that Uq = minjuio : I < i < n} satisfies Uq > 



1 — Ef=i ^i- ^ E;=i A/ < 1 — then draw ad- 
ditional stick breaking weights Vk*+i, . . . , Vfc* +f j 

independently from Beta(l,a) until E;=i~ > 
1 - Uq, ensuring that Ez*U*+d+i 1 ( u io < Xh) = 



for all 1 < i < n. Then the support of ( 15 ) is con- 
tained within 1, 2, . . . , k* + d and we can compute 
the normalizing constant exactly. 

3. Block update (JAi,W , 0): 

(a) Update (/3, (j) | {cy : c i0 = r},C ) for 1 < 
j 5: Pi 1 < r < ^o- If the concentration pa- 
rameter is shared across global clusters (that 
is, (3r^ = /9^) then a straightforward condi- 
tional independence argument gives 



ptfV) | {c ij -:c i0 = r},C ) 

,(3)> tt ^wn- r (^ (i) ) 



T(/30') + n r ) 



(16) 



where 



= \{i 



"}| and 5j> = |{/i 



)7i 3 >(, > 0}|. Note that terms with n r = 1 
(corresponding to top-level singleton compo- 
nents) do not contribute, since 



r((3^ + 1). The updating scheme of 



Escobar 



& West ( |1995 ) is simple to adapt here using 



|2?o| independent auxiliary variables, 
(b) For r e V Q update \ C ,Ci,Pr } ) 



by drawing U^J ~ Beta(l 
Eflft+i^)for i<h<q 



(J) 



(c) Label switching moves: For 1 < j ; < p, 

i. From X>j choose two elements /ii,/«2 uni- 
formly at random and change their labels 
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with probability min(l,et) where 



n 



hphi 
>,(i) 



ii. Sample a label ft, uniformly from 
l,2,...,fc* and propose to swap the 
labels h, h + 1 and corresponding stick 
breaking weights. Accept with probabil- 
ity min(l,a) where 



i(fc= fc j) 



1 - U, 



U) 



n / ,,, 

h ev II - ^ r(h+1) 



(17) 



(d) Sample (uy|cj,*) ~ ^(0,^6™^) indepen- 
dently for l< j <p, 1 < i < n. 

4. Update Cj for 1 < j < p independently. We have 



Pr{c tJ = k | y, e,Uij,c i0 ,*) 
ociC^j/rf)!^ <^Sfc) (18) 



As in step 2 we determine the support of the 
full conditional distribution as follows: Let u* = 



1 < i < n}. For all r e P , if 



Sft, 3 =i V'rl^ < 1 — W j tnen extend the stick break- 
ing measure ipr^ by drawing d r new stick break- 
ing weights from the prior so that Ylh=i ^rh > 
l-u*. Draw 9$ +1 , 6$ +d ~ p(6^) indepen- 



dently (where d = max{<i r 
update Cij from 



e Vj}). Then 



Pr(cij = k | y,8,^,C(o,*) 

^fa J ;^ ) )i(^ 



(19) 



4.2 Inference 

Given samples from the MCMC scheme above we can 
estimate the predictive distribution as 



f(y n+1 1 y») = =2 X] 51 • • ■ E A S 



£= 1 /in — 1 /li —1 /in — 1 



(20) 



Each of the inner sums in (20) is a truncation ap- 



proximation, but it can be made arbitrarily precise 
by extending the stick breaking measures with draws 
from the prior and drawing corresponding atoms from 
p{6^>). In practice this usually isn't necessary as any 
error in the approximation is small relative to Monte 
Carlo error. 

The other common inferential question of interest in 
the MDM settings is the dependence between com- 
ponents, for example testing whether component jl 
and jl are independent of each other. As already 
noted, the dependence between the components comes 
in through the dependence between the cluster alloca- 
tions and therefore, tests for independence between 
jl and j2 is equivalent to testing for independence 
between their latent cluster indicators Cji and Cj2- 
Such a test can be constructed in terms of the diver- 
gence between the joint and marginal posterior distri- 
butions of Cji and Cj2- The Monte Carlo estimate of 
the Kulback Leibler divergence between the joint and 
marginal posterior distributions is given as, 



t— 1 hji — l hj2 — l y/fcg=l 
y-fco \(*),/,(*) ,/,(*) 



X loe 



(21) 



Under independence, the divergence should be 0. 
Analogous divergences can be considered for testing 
other general dependancies, like 3-way, 4-way indepen- 
dences. 



5. Update (0|— ) by drawing from 

{i:cij=h} 

for each 1 < j < p and 1 < h < k* 



5 EXPERIMENTS 

Our approach can be used for two different objectives 
in the context of mixed domain data - for prediction 
and for inference on the dependence structure between 
different data types. We outline results of experiments 
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with both simulated and real data that show the per- 
formance of our approach with respect to both the 
objectives. 

5.1 Simulated Data Examples 

To the best of our knowledge, there is no standard 
model to jointly predict for mixed domain data as well 
as evaluate the dependence structure, so as a competi- 
tor, we use a joint DPM. To keep the evaluations fair, 
we use two scenarios. In the first the ground truth is 
close to that of the joint DPM, in the sense that all the 
components of the mixed data have the same cluster 
structure. The other simulated experiment considers 
the case when the ground truth is close to the ITF, 
where different components of the mixed data ensem- 
ble have their own cluster structure but clustering is 
dependent. The goal here in each of the scenarios is 
to compare joint DPM vs ITF in terms of recovery of 
dependence structure and predictive accuracy. 

For scenario 1, we consider a set of 1,000 individuals 
from whom an ensemble comprising of T, a time se- 
ries R, a multivariate real- valued response (€ 3? 4 ) and 
C1,C2,C3, 3 different categorical variables have been 
collected, to emulate the type of data collected from 
patients in cancer studies and other medical evalua- 
tions. For the purposes of scenario 1, we simulate T, 
R, CI, C2, C3 each from a mixture of 3 clusters. For 
example, R is simulated from a two-component mix- 
ture of multivariate normals with different means, R 
is simulated from a mixture of two autoregressive ker- 
nels and each of the categorical variables from a mix- 
ture of two multinomial distributions. If we label the 
clusters as 1 and 2, for each simulation, either all of 
the ensemble (T,R,C1,C2,C3) comes from 1 or all of 
it comes from 2. After simulation we randomly hold 
out R in 50 individuals, CI, C2 in 10 each, for the 
purposes of measuring prediction accuracy. For the 
categorical variables prediction accuracy is considered 
with a — 1 loss function and is expressed as a percent 
missclassification rate. For the multivariate real vari- 
able R, we consider squared error loss and accuracy is 
expressed as relative predictive error. We also evaluate 
for some of the pairs their dependence via estimated 
mutual information. 

For scenario 2, the same set-up as in scenario 1 is 
used, except for the cluster structure of the ensem- 
ble. Now simulations are done such that T falls into 
three clusters and this is dependent on R and CI. C2 
and C3 depend on each other and are simulated from 
two clusters each but their clustering is independent 
of the other variables in the ensemble. We measure 
prediction accuracy using a hold out set of the same 
size as in scenario 1 and also evaluate the dependence 
structure from the ITF model. 



In each case, we take 100,000 iterations of the MCMC 
scheme with the first few 1,000 discarded as a burn-in. 
These are reported in table [I] (left). We also summa- 
rize the recovered dependence structure in table [T] and 
in table [2] In scenario 1, the prediction accuracy of 
ITF and DPM are comparable, with DPM perform- 
ing marginally better in a couple of cases. Note that 
the recovered dependence structure with the ITF is ex- 
actly accurate which shows that the ITF can reduce to 
joint co-clustering when that is the truth. In scenario 
2, however there is significant improvement in using 
the ITF over the DPM with predictive accuracy. In 
fact the predictions from the DPM for the categorical 
variable are close to noise. The dependence structure 
recovered the ITF almost reflects the truth as com- 
pared to that from the DPM which predicts every pair 
is dependent, by virtue of its construction. 

5.2 Real Data Examples 

For generic real mixed domain data the dependence 
structure is wholly unknown. To evaluate how well 
the ITF does in capturing pairwise dependencies, we 
first consider a network example in which recovering 
dependencies is of principal interest and prediction is 
not relevant. We consider data comprising of 105 po- 



litical blogs (Adamic & Glance 2005) where the edges 



in the graph are composed of the links between web- 
sites. Each blog is labeled with its ideology, and we 
also have the source(s) which were used to determine 
this label. Our model includes the network, ideology 
label, and binary indicators for 7 labeling sources (in- 
cluding "manually labeled" , which are thought to be 
the most subject to errors in labelings). We assume 
that ideology impacts links through cluster assignment 
only, which is a reasonable assumption here. We col- 
lect 100,000 MCMC iterations after a short burn-in 
and save the iterate with the largest complete-data 
likelihood for exploratory purposes. 

Fig. [TJshows the network structure, with nodes colored 
by ideology. It is immediately clear that there is signif- 
icant clustering, apparently driven largely by ideology, 
but that ideology alone does not account for all the 
structure present in the graph. Joint DPM approach 
would allow for only one type of clustering and prevent 
us from exploring this additional structure. The recov- 
ered clustering in fig. [2]reveals a number of interesting 
structural properties of the graph; for example, we see 
a tight cluster of conservative blogs which have high 
in- and out- degrees but do not link to one another 
(green) and a partitioning of the liberal blogs into a 
tightly connected component (purple) and a periphery 
component with low degree (blue). The conservative 
blogs do not exhibit the same level of assortative mix- 
ing (propensity to link within a cluster) as the liberal 
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blogs do, especially within the purple component. 

To get a sense for how stable the clustering is, we esti- 
mate the posterior probability that nodes i and j are 
assigned to the same cluster by recording the number 
of times this event occurs in the MCMC. We observe 
that the clusters are generally quite stable, with two 
notable exceptions. First, there is significant posterior 
probability that points 90 and 92 are assigned to the 
red cluster rather than the blue cluster. This is sig- 
nificant because these two points are the conservative 
blogs which are connected only to liberal blogs (see fig. 
[I]). While the graph topology strongly suggests that 
these belong to the blue cluster, the labels are able to 
exert some influence as well. Note that we do not ob- 
serve the same phenomenon for points 7, 15, and 25, 
which are better connected. We also observe some am- 
biguity between the purple and blue clusters. These 
are nodes 6, 14, 22, 33, 35 and 36, which appear at the 
intersection of the purple/blue clusters in the graph 
projection because they are not quite as connected as 
the purple "core" but better connected than most of 
the blue clusters. 

Finally, we examine the posterior probability of being 
labeled "conservative" (fig. [3]). Most data points are 
assigned very high or low probability. The five labeled 
points stand out as having uncharacteristic labels for 
their link structure (see fig[l]). Since the observed label 
doesn't agree with the graph topology, the probability 
is pulled away from 0/1 toward a more conservative 
value. This effect is most pronounced in the three 
better-connected liberal blogs (lower left) versus the 
weakly connected conservative blogs (upper right). 

For the second example, we use data obtained from 
the Osteoarthritis Initiative (OAI) database, which is 
available for public access at |http : / /www . oai . ucsf 7\ 
edu/ . The question of interest for this data is investi- 
gate relationships between physical activity and knee 
disease symptoms. For this example we use a subset 
of the baseline clinical data, version 0.2.2. The data 
ensemble comprises of variables including biomarkers, 
knee joint symptoms, medical history, nutrition, phys- 
ical exam and subject characteristics. In our subset 
we take an ensemble of size 120 for 4750 individuals. 
We hold out some of the biomarkers and knee joint 
symptoms and consider prediction accuracy of the ITF 
versus the joint DPM model. For the real variables, 
mixtures of normal kernels are considered, for the cat- 
egorical, mixtures of multinomials and for the time 
series, mixtures of fixed finite wavelet basis expansion. 

Results for this experiment are summarized in table [3] 
for 4 held-out variables. ITF outperforms the DPM 
in 3 of these 4 cases and marginally worse prediction 
accuracy in case of the other variable. It is also in- 



teresting to note that ITF helps to uncover useful re- 
lationships between medical history, physical activity 
and knee disease symptoms, which has a potential ap- 
plication for clinical action and treatments for the sub- 
sequent patient visits. 

6 CONCLUSIONS 

We have developed a general model to accommodate 
complex ensembles of data, along with a novel algo- 
rithm to sample from the posterior distributions aris- 
ing from the model. Theoretically, extension to any 
number of levels of stick breaking processes should be 
possible, the utility and computational feasibility of 
such extensions is being studied. Also under investiga- 
tion is connections with random graph/network mod- 
els and theoretical rates of posterior convergence. 

Table 1: Simulation Example, Scenario 1: Prediction 
error (top), tests of independence (bottom) 





ITF 


DPM 


T 


1.79 


1.43 


C2 


31% 


23 % 


C3 


37% 


36 % 





ITF 


DPM 


"Truth" 


CI 


vs 
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Yes 


Yes 


Yes 


C2 


vs 
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Yes 


Yes 


Yes 


C3 


vs 


T 


Yes 


Yes 


Yes 


C2 


vs 


R 


Yes 


Yes 


Yes 



Table 2: Simulation Example, Scenario 2: Prediction 
error (top), tests of independence (bottom) 





ITF 


DPM 


T 


4.61 


10.82 


C2 


27% 


55 % 


C3 


34% 


57% 





ITF 


DPM 


"Truth" 


CI vs T 


Yes 


Yes 


Yes 


C2 vs T 


No 


Yes 


No 


C3 vs T 


No 


Yes 


No 


C2 vs R 


No 


Yes 


No 
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Bayesian learning of joint distributions of objects 




Figure 1: Network Example: True Clus- 
tering 



Figure 3: Network Example: Pairwise 
cluster assignment probability. Left bars 
correspond to clustering in Fig. [2| top 
bars correspond to clustering on the ide- 
ology label. 
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Table 3: OAI Data example: Relative Predictive Accu- 
racy. The variables are respectively, left knee baseline 
pain, isometric strength left knee extension, left knee 
paired X ray reading, left knee baseline radiographic 
OA. 





ITF 


DPM 


P01BL12SXL 


31.21 


100.92 


V00LEXWHY1 


7.94 


7.56 % 


VOOXRCHML 


23.01 


31.84 % 


P01LXRKOA 


65.78 


90.30 % 



Figure 2: Network Example: Recovered 
Clustering 
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